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Abstract 

We present a mathematical construction for the restricted Boltzmann machine 
(RBM) that doesn’t require specifying the number of hidden units. In fact, the 
hidden layer size is adaptive and can grow during training. This is obtained by 
first extending the RBM to be sensitive to the ordering of its hidden units. Then, 
thanks to a carefully chosen definition of the energy function, we show that the 
limit of infinitely many hidden units is well defined. As with RBM, approximate 
maximum likelihood training can be performed, resulting in an algorithm that 
naturally and adaptively adds trained hidden units during learning. We empiri¬ 
cally study the behaviour of this infinite RBM, showing that its performance is 
competitive to that of the RBM, while not requiring the tuning of a hidden layer 
size. 


1 Introduction 


Over the years, machine learning research has produced a large variety of latent 
variable probabilistic models. These include mixture models, factor analysis mod¬ 
els, latent dynamical models, and many others. Such models usually require that 
the dimensionality of the latent representation be specified and fixed during learn¬ 
ing. Adapting this quantity is then considered as a separate process, that takes the 
form of model selection and is normally treated as an additional hyper-parameter 
to tune. 

For this reason, more recently, there has been a lot of work on extending these 
models such that the size of the representation can be treated as an adaptive 
quantity during training. These extensions, often referred to as ’’infinite” models, 
are non-parametric in nature where the latent space is infinite with probability 1 


and can arbitrarily adapt their capacity to the training data (see Orbanz and Teh 


(2010) for a brief overview) 





Figure 1: Graphical model of the restricted Boltzmann Machine. Inter¬ 

connections between visible units and hidden units using symmetric weights. 


While most latent variable models have been extended to one or more infinite 
variants, a notable exception is the restricted Boltzmann machine (RBM). The 
RBM is an undirected graphical model for binary vector observations, where the 
latent representation is itself a binary vector (i.e. hidden layer). The RBM (and 
its extensions to non-binary vectors) have been successfully applied to a variety 
of problems and data, such as images (Ranzato et al. 2010), movie user prefer¬ 


ences (Salakhutdinov et al. 2007), motion capture (Taylor et al. 2011), text (Dahl 


et al. 2012) and many others. One explanation for the lack of literature on RBMs 
with an adaptive hidden layer size comes from its undirected nature. Indeed, 
undirected models tend to be less amenable to a Bayesian treatment of learning, 
on which relies the majority of the literature on infinite models. 

Our main contribution in this paper is thus a proposal for an infinite RBM 
which can adapt the effective number of hidden units during training. While our 
proposal is not based on a Bayesian formulation, it does correspond to the infinite 
limit of a finite-sized model and behaves in such a way that it effectively adapts 
its capacity as training progresses. 

First, we propose a finite extension of the RBM that is sensitive to the position 
of each unit in its hidden layer. This is achieved by introducing a random variable 
that represents the number of hidden units intervening in the RBM’s energy func¬ 
tion. Then, thanks to the introduction of an energy cost for using each additional 
unit, we show that taking the infinite limit of the total number of hidden units is 
well defined. We describe an approximate maximum likelihood training algorithm 
for this infinite RBM, based on (Persistent) Contrastive Divergence, which results 
in a procedure where hidden units are implicitly added as training progresses. Fi¬ 
nally, we empirically report how this model behaves in practice and show that it 
can achieve performance that is competitive to a traditional RBM on the binarized 
MNIST and CaltechlOl Silhouettes datasets, while not requiring the tuning of a 
hyper-parameter for its hidden layer size. 


2 Restricted Boltzmann Machine 

We describe the basic RBM model, which we’ll build on to derive its ordered and 
infinite versions. 

An RBM is a generative stochastic neural network composed of two layers: 
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visible v and hidden h. These layers are fnlly connected to each other, while 
connections within a layer are not allowed. This means each unit v, t is connected 
to all hj units via undirected weighted connections (Figure [Tj) . 

Given a binary RBM with D visible units and K hidden units, the set of visible 
vectors is V = (0,1} D , whereas the set of hidden vectors is "H = (0,1}^. In an 
RBM model, each configuration (v, h) e V x % has an associated energy value 
defined by the following function: 

E(v, h) = — h T Wv - v r b v - h T b h (1) 


The parameters 0 = (W, b v , b h } of this model are the weights W (K x D matrix), 
the visible unit biases b v (h x 1 vector) and the hidden unit biases b h (K x 1 
vector). 

A probability distribution over visible and hidden vectors is defined in terms 
of this energy function: 

P(v,h) =-e“ E(v ' h) (2) 

Z 


with 




^2 ^2 e“ E(v '’ h,) . 

v'ev h'eH 


( 3 ) 


We see from Equation (|3| that the partition function Z (normalizing constant) 
is intractable, as it requires summing over all possible 2 < ' D+K " > configurations. 

The probability distribution of a visible vector is obtained by marginalizing 
over all configurations of hidden vectors. One property of the RBM is that the 
numerator of the marginal -P(v) is tractable: 


P(v) 


1 ^2 e - E(v ' h,) 

Z h'en 


ip--Gv) 

Z 


( 4 ) 


P(v) = -v T b v - J2 soft+(Wj.v + b E ) (5) 

Z— 1 

where soft + (a;) = ln(l + e x ) and the notation W t . designates the i th row of W, 
likewise for columns W .j. This allows for an equivalent definition of the RBM 
model in terms of what is known as the free energy F(v). However, the partition 
function still requires summing over all configurations of visible vectors, which is 
intractable even for moderate values of D. 

RBMs can be learned as generative models, to assign high probability (i.e. low 
energy) to training observations and low probability otherwise. One approach 
is to minimize the average negative log-likelihood (NLL) for a set of examples 

V = (vX =1 : 


The gradient of this objective has a simple form, which is often referred to as the 
combination of positive and negative phases: 

1 N 

V»/(e, V) = JjY. WF(v„) - ]T P(v') V.F(v') (7) 

n=i v'ev 

S ---' ^v' 

Positive phase Negative phase 

where 

V w f(v) = - E[h|v]v T = -h(v)v T (8) 

V b hF(v) = — E[h|v] = —h(v) (9) 

V b vF(v) = —v (10) 

and where h(v) = <t(Wv + b h ) with ct(-) being the sigmoid function cr(x) = 
1+ l- x applied element-wise. Derivation for the partial derivatives can be found in 
Appendix [Aj 

Intuitively, the positive phase pushes up the probability of examples coming 
from our training set, whereas the negative phase lowers the probability of exam¬ 
ples generated by the model. Much like the partition function, the negative phase 
is intractable. To overcome this we approximate the expectation under P(v) with 
an average of S samples S = {v s }f =1 drawn from P(v) i.e. the model. 

, N i s 

V«,/(e.®)«-^V„F(v„)--^VF(V,) (11) 

n= 1 s =1 

^ ✓ V ✓ 

-S'- V' 

Positive phase Negative phase 


Moreover, mini-batch training is usually employed and consists in replacing the 
positive phase average by one over a small subset of the training set, different for 
every training update. 

Sampling from P(v) can be achieved using block Gibbs sampling, by alternat¬ 
ing between sampling v ~ P(v|h) and h ~ P(h|v). It can be done efficiently 
because RBMs have no connections within a layer, meaning that hidden units are 
conditionally independent given the visible units and vice versa. The conditional 
distributions of a binary RBM are Bernoulli distributions with parameters 


P{hi = l|v) = cr(Wj.v + b\) (12) 

P{vj = l|h) = cr(h T W. i + b)) (13) 


In theory, the Markov chain should be run until equilibrium before drawing a 
sample for every training update, which is highly inefficient. Thus, Contrastive Di¬ 
vergence (CD) learning is often employed, where we initialize the update’s Gibbs 


chains to the training examples and only perform T steps of Gibbs sampling (Hin¬ 


ton 2002[). Another approach, referred to as stochastic approximation or Persis¬ 


tent CD (PCD) ( |Tieleman 2008), is to not reinitialize the Gibbs chains between 
updates. 
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Figure 2: Illustration of the ordered RBM. Since z — 2 only the first two hidden 
units are selected. 

3 Ordered Restricted Boltzmann Machine 

The model we propose is a variant of the RBM where the hidden units h are 
ordered from left to right, with this order being taken into account by the energy 
function. We refer to this model as an ordered RBM (oRBM). As shown in Fig¬ 
ure [2j the oRBM takes hidden unit order into account by introducing a random 
variable 0 that can be understood as the effective number of hidden units par¬ 
ticipating to the energy. Hidden units are selected starting from the left and the 
selection of each hidden unit is associated with an incremental cost in energy. 

Concretely, we define the energy function of the oRBM as 

Z 

E(v, h, z) = -v T b v - J2 NW,v + b h t ) - Pi) (14) 

i=l 

where z represents the number of selected hidden units that are active and Pi 
is a energy penalty for selecting each i th hidden unit. As we will see, carefully 
parametrizing the per unit energy penalty will allow us to consider the case of an 
infinite pool of hidden units. 

In our experiments, as we wanted the filters of each unit to be the dominating 
factor in a unit being selected, we parametrized it as Pi = /3soft + (&^), where P is 
a global hyper-parameter (critically, as we’ll discuss later, this hyper-parameter 
doesn’t actually require tuning and a generic value for it works fine). Intuitively, 
the penalty term acts as a form of regularization since it forces the model to avoid 
using more hidden units than needed, prioritizing smaller networks. 

Moreover, having the penalty depending on the hidden biases also implies that 
the selection of a hidden units (i.e. influencing the outcome of the random variable 
z) will be mostly controlled by the values taken by the connections W. Higher 
values of the bias of a hidden unit will not increase its probability of being selected. 
In other words, for the model to increase its capacity and better fit the training 
data, it will have to learn better filters. Note that alternative parametrizations 
could certainly be considered. 

As with the RBM, P(v) is defined in terms of its energy function. For this, we 
have to specify the set of legal values for v, h and z. Since, for a given z, the value 
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of the energy is irrelevant for the dimensions of h from z to K, we will assume 
they are set to 0. There is thus a coupling between the value of z and the legal 
values of h. We will note T-L z = {h e V.\hk = 0 \/k > z} the legal values of h for a 
given z. As for z, it can vary in {1 ,,K}, and v e V as usual. 

The joint probability over v, h and z is thus: 

P (v, h, z) = (15) 


where 


K 


^ = EE E 


3 -E(v',hV) 


z'=iv'ev h'en., 


(16) 


As for the marginal distribution P(v) of the oRBM model, it can also be written 
in terms of a free energy. Indeed, in a derivation similar to the case of the RBM, 
we can show: 


P(v) 


F(v,z) 


1 

Z 


K 

^2 ^ e _E(v,h,2:) 

2=1 h £H z 


A e_FM 

Z=1 


-v 2 b v - ^2 (soft+(Wj.v + - fit) 

i= 1 


( 17 ) 

(18) 


This gives us a free energy where only the hidden units have been marginalized. 
We can also derive a formulation where the free energy depends only on v: 


K 


K 


= z E 


e ~F(y,z) = ± e -F(y) 


with P(v) = — In E' 


,-F(v,z) 


(19) 


2=1 


. 2=1 


It should be noticed that, in the oRBM, z does not correspond to the number 
of hidden units assumed to have generated all observations. Instead, the model 
allows for different observations having been generated by a different number of 
hidden units. Specifically, for a given v, the conditional distribution over the 
corresponding value of z is 


P(z |v) 


exp(~ F(v,z)) 
Eb.exp(-F( v,2>)) ' 


( 20 ) 


As for the conditional distribution over the hidden units, given a value of z it takes 
the same form as for the regular RBM, except for unselected hidden units which 
are forced to zero. Similarly, the distribution of v given a value of the hidden layer 
and z reflects that of the RBM: 


cr(Wj.v + bf) if i < z 
0 otherwise 


P(hi = 1| v,z) = 

P(vj = l|h, z) = a ( ^2 w ijhi + b 


, 2=1 


( 21 ) 

( 22 ) 
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To train the oRBM, we can also rely on CD or PCD for estimating the gra¬ 
dients based on Equation 11 but using F(v) as defined in equation 19 Defining 


l z = [l(7. ., 1, 0,..., 0] T and cdf(T|v) = [ P(z < l|v),...,P(z < A'|v)] T with 
0 denoting the element-wise product, the free energy gradients are then slightly 
modified as follows: 


V w F(v) = - E [h © l*|v]v T = —(h(v) © (1 - cdf(>|v)))v T (23) 

h,z 

V b h F(v) = - E [(h - /Ar(b h )) © l z \v\ = (h(v) - /3<r(b h )) © (1 - cdf(^|v)) 

h,z 

(24) 

V b vF(v) = -v (25) 

with h(v) = <t(Wv + b h ). Derivation for the partial derivatives can be found in 
Appendix [Aj 

Compared to the RBM, computing these gradients requires one additional 
quantity: the vector of cumulative probabilities cdf(z|v). Fortunately, this quan¬ 
tity can be efficiently computed, in O(K), by first computing the required proba¬ 
bilities vector P(z |v) and performing a cumulative sum. 

Sampling from P(v) slightly differs from the RBM as we need to consider z 
in the Markov chain. With the oRBM, Gibbs steps alternate between sampling 
(h, z) ~ P(h, £|v) and v ~ P(v|h, z). Sampling from P(h,z|v) is done in two 
steps: ^ ~ P(z |v) followed by h P(h|v, z). 

During training, what we observe is that the hidden units are each trained grad¬ 
ually, in sequence, from left to right. This effect is mainly due to the multiplicative 
term (1 — cdf(z|v)) in the hidden unit parameter updates of Equations [23] and |24| 
which is monotonically decreasing. Effectively, the model is thus growing in ca¬ 
pacity during training, until its maximum capacity of K hidden units. 


4 Infinite Restricted Boltzmann Machine 


The growing behaviour of the oRBM begs for the question: could we achieve a 
similar effect without having to specify a maximum capacity to the model? Indeed, 
while Montufar and Ay (2010) have shown that with 2 V ~ 1 — 1 hidden units an 


RBM is a universal approximator, a variant of the RBM that could automatically 
increase its capacity until it is sufficiently high is likely to yield much smaller 
models in practice. 

ft turns out that this is possible, by taking the limit of K —> oo. For this 
reason, we refer to this model as the infinite RBM (iRBM). 

This limit is made possible thanks to two modeling choices. The first is the 
assumption that a finite (but variable!) number of hidden units have non-zero 
weights and biases. This is trivial to ensure, for any optimization procedure, 
using any amount of any type of weight decay (e.g. L2 or LI regularization) on all 
the weights and hidden biases. An infinite number of non-zero weights and biases 
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Figure 3: Illustration of the infinite RBM. With z = 2, only the first two hidden 
units are currently selected. The dashed lines illustrate that there are connections 
that are trained (non-zero) with the third hidden unit. All (infinitely many) hidden 
units after the third have zero-valued weights, which correspond to l being equal 
to 3. 


could then correspond to an infinite penalty, so no proper optimization would 
ever diverge to this solution, no matter the initialization. This is guaranteed 
when using LI regularization, thanks to its sparsity inducing property. As for L2 
regularization, while it could theoretically lead to an infinite number of hidden 
units (e.g. if the L2 norm of the parameters associated with each hidden unit 
decreases exponentially with respect to the position of the hidden unit), in practice 
the floating precision would clip very small parameters to zero, thus having a finite 
number of hidden units. 

The second key choice is our parametrization of the per-unit energy penalty 
/3j, which will ensure that the infinite sums required in computing probabilities 
will be convergent. For instance, consider the conditional P(z |v): 


P(z |v) 


exp(—F(v,z)) 
Z(v) 


exp(—F(v, z)) 
ESUexp (~F(v,z')) 


(26) 


Let’s note l the number of effectively trained hidden units, i.e. where all hidden 
units > l have zero weights and biases. This is guaranteed to happen thanks to 
the growing behaviour that ensures hidden units are ordered from left to right. 
Then, we can split the normalization constant Z(v) of Equation [26] into two parts, 
split at z = l, as follows: 


Z( v ) = £exp(-F(v,z)) + ^ exp (~F(v,z)) 

Z= 1 Z=l+ 1 

l oo / z 

= ^exp(-F(v,z)) + ^ exp -F(v, Z) + ^ soft + (W,v + $) - ft 

z = 1 z = /“hi \ i=l-\-l 

l oo 

= J>xp(—F(v,z)) +exp(—F(v,0)5>xp((l -/3)soft+(0)) 2 (27) 


Z= 1 


Geometric series 


z =1 

'N_ 










where Equation [27] is obtained by exploiting the fact that all weights and biases 
of hidden units at position / + 1 and higher are zero. By ensuring that (3 > 1, the 
geometric series of Equation [27] is finite and can be analytically computed. This 
in turn implies that P(z |v) is tractable and can be sampled from. Following a 
similar reasoning, the global partition function Z can be shown to be finite (see 
Appendix [B]) , thus yielding a properly defined joint distribution for any configu¬ 
rations with a hnite number of non-zero weights and hidden biases. 

One could think that, compared to a regular RBM, we have merely traded 
the hyper-parameter of the hidden layer size with the hyper-parameter (3. How¬ 
ever, crucially, /3’s role is only to ensure that the iRBM is properly defined, and 
the penalty it imposes in the energy function can be compensated by the learned 
parameters. The extent to which the parameters can grow enough to compen¬ 
sate for that penalty is then controlled by the strength of weight decay, a hyper¬ 
parameter the iRBM shares with the RBM. We’ve thus effectively removed one 
hyper-parameter. Moreover, we’ve indeed observed that results are robust to the 
choice of /3, that is finely tuning beta was not necessary to ultimately achieving 
good performance. While the choice of f3 can impact the number of epochs it 
would take for the weights to compensate for the penalty, this (the number of 
epochs) is a quantity that must be tuned anyways, even in regular RBMs. 

The question of the identihability of the binary RBM is a complex one, which 
has been studied (Cueto et al. 2010). Unlike the RBM, the iRBM is sensitive 
to the ordering of its hidden units, thanks to the penalty term. This means 
permutations of iRBM’s hidden units do not correspond to the same distribution, 
making its parametrization more identifiable. 

As for learning, it can be done mostly by following the procedure of the oRBM, 
i.e. minimizing the NLL with stochastic gradient descent using (Persistent) CD to 
approximate the gradients. One slight modification is required however. Indeed, 
since the free energy gradient for the hidden weights and biases can be non-zero 


for all (infinite) hidden units, we cannot use the gradient of Equations 23 and 24 
for all hidden units. 

To avoid this issue, we consider the following observation. Instead of using 
the derivative of F(v), we could instead use the derivative of F(y,z ), where z is 
obtained by sampling from P(z |v): 


V w F(v, z) = - E[h O l z \z, v]v T = -(h(v) © l 2 )v 2 


(28) 


V b h F(v,z) = -E[(h - /3er(b h )) © 1 2 |z,v] = -(h(v) - /3cr(b h )) © 1, . (29) 

h 


In this case, all weights and biases with an index greater than the sampled 0 
have a gradient of zero, i.e. do not require any update. Moreover, the expectation 
of these gradients with respect to z (conditioned on v) are the gradients of F(y), 
making them unbiased in this respect. This comes at the cost of higher variance 
in the updates. But thanks to this observation, we are justified to use a hybrid 
approach, where we use the F(v) gradients only for the units with index less or 
equal than /, and ’’use” the gradient of F(v, z) for the other units, i.e. leave them 
set to zero. 
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As previously mentioned, we use weight decay to ensure that the number of 
non-zero parameters cannot diverge to infinity. For practical reasons, our imple¬ 
mentation also used a capacity-limiting heuristic. If the Gibbs sampling chain 
ever sampled a value for z that is greater than l, then we clamped it to l + 1. 
Intuitively, this corresponds to ’’adding” a single hidden unit. This avoids filling 
all the memory in the (unlikely) event where we’d draw a large value for z. When 
adding a hidden unit, its associated weights and biases are initialized to zero. 

We emphasize that these were not required to avoid divergence (weight decay is 
sufficient): it merely ensured a practical and efficient implementation of the model 
on the GPU. Note also that when using LI regularisation, l can decrease in value, 
thanks to the sparsity promoting property of the LI norm. Again, we highlight 
that while a finite number of weights and biases is maintained, that number of 
such weights does vary and is learned, while the implicit number of hidden units 
is indeed infinite (infinitely many contribute to the partition function). 


5 Related Work 


This work falls within the research literature on discovering extensions of the 
original RBM model to different contexts and objectives. Of note here is the 
implicit mixture of RBMs (Nair and Hinton, 2008). Indeed, the oRBM can be 
interpreted as a special case of an implicit mixture of RBMs. Writing -P(v) as 
P(z)P(v|z) we see that the oRBM is an implicit mixture of K RBMs, where 
each RBM has a different number of hidden units (from 1 to A') and the weights 
are tied between RBMs. The prior P(z) represents the probability of using the 
z th RBM and is also derived from the energy function. However, as in the implicit 
mixture of RBMs, P(z) is intractable as it would require the value of the partition 
function. That said, the work of Nair and Hinton (2008) is otherwise very different 


and did not address the question of having an RBM with adaptive capacity. 


Another related work is that of the Cardinality RBMs proposed by Swersky 


et al. (2012). They used a cardinality potential to control the sparsity of the 
RBM, i.e. limiting the number of hidden units that can be active. In the oRBM 
and the iRBM, z effectively acts as an upper bound on the number of hidden 
units hi that can be equal to 1, since we are limiting h to be in T-L z , a subset 
of H. In their work, Swersky et al. (2012) use cardinality potentials that allow 
only configurations having at most k active hidden units. One difference with our 
work however is that their cardinality potential is order agnostic, meaning that 
the active hidden units can be positioned anywhere within the hidden layer while 
still satisfying the cardinality potential. On the other hand, in the oRBM, all 
units with index higher than z must be set to zero, with only the previous hidden 
units being allowed to be active. In addition, their parameter k is fixed during 
training whereas our number of active hidden units z changes depending on the 
input. 

The oRBM also bears some similarity with autoencoders trained by a nested 


version of dropout (Rippel et ah, 2014). Nested dropout works by stochastically 


selecting the number of hidden units used to reconstruct an input example at 


10 

















training time, and so independently for each update and example. Rippel et al 


(2014) showed that this defines a learning objective that makes the solution iden¬ 
tifiable and no longer invariant to hidden unit permutation. In addition to being 
concerned with a different type of model, this work doesn’t discuss the case of an 
unbounded and adaptive hidden layer size. 


Welling et al. (2003) proposed a self supervised boosting approach, which is 


applicable to the RBM and in which hidden units are sequentially added and 
trained. However, like boosting in general and unlike the iRBM, this procedure 
trains each hidden unit greedily instead of jointly, which could lead to much larger 
networks than necessary. Moreover, it is not easily generalizable to online learning. 

While the work on unsupervised neural networks with adaptive hidden layer 
size is otherwise relatively scarse, there’s been much more work in the context of 
supervised learning. There is the well known work of Fahlman and Lebiere (1990) 
on Cascade-Correlation networks. More recently, Zhou et al. (2012) proposed a 
procedure for learning discriminative features with a denoising autoencoder (a 
model related to the RBM). The procedure is also applicable to the online set¬ 
ting. It relics on invoking two heuristics that either add or merge hidden units 
during training. We note that the iRBM framework could easily be generalized 
to discriminative and hybrid training as in Zhou et ah (2012). The corresponding 
mecanisms for adding and merging units would then be implicitly derived from 
gradient descent on the corresponding supervised training objective. 

Finally, we highlight that our model is not based on a Bayesian formulation, 
as most of the literature on infinite models. On the other hand, it does correspond 
to the infinite limit of a finite-sized model and yields a model that can learn its 
size with training. 


6 Experiments 


We compare the performance of the oRBM and the iRBM with the classic RBM 



since our objective with the iRBM is to effectively remove a hyper-parameter of 
the RBM, instead of achieving improved performances, we focus our comparison 
on this baseline. 


1 http://github.com/MarcCote/iRBM 
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All NLL results of this section were obtained by estimating the log- partit ion 


function In Z using Annealed Importance Sampling (AIS) (Salakhutdinov and 


Murray, 2008) with 100,000 intermediate distributions and 5000 chains. As an 


additional validation step, samples were generated from best models and visually 
inspected. 

Each model was trained with mini-batch stochastic gradient descent using 
batch size of 64 examples and using PCD with 10 Gibbs steps between parameter 
updates. We used the ADAGRAD stochastic gradient update (Duchi et ah, 2011), 
a per-dimension learning rate method, to train the oRBMs and the iRBMs. We 
found that having different learning rates for different hidden units was very ben¬ 
eficial, since units positioned earlier in the hidden layer will approach convergence 
faster than units to their right, and thus will benefit from a learning rate decaying 
more rapidly. We tried several learning rates Ir G {5xl0 _1 ,10 _1 , 5xl0~ 2 ,10~ 2 } and 
always set ADAGRAD’s epsilon parameter to 10 -6 . 

We also tested different values for both LI and L2 regularization’s factor 
A G {0,10~ 2 ,10~ 3 ,10" 4 ,10“ 5 }. Note that we allow the iRBM to shrink only 
if LI regularization is used. 

We did try varying the (3 found in the penalty term and as expected we’ve 
found results to be robust to its value. Since (3 must be greater than 1, we 
explored positive constants to add to 1, on a log scale (1, 0.25, 0.1, 0.01, 0.001, 
etc.). We settled on using (3 = 1.01 for all experiments as it provides a penalty 
high enough to have a growing behavior and requires around five hundred epochs 
for the weights to compensate for the penalty. 

Finally, we note that improved performances could certainly have been achieved 
using an improved sampler (e.g. parallel tempering (Desjardins et ah, 2010)) or 
parametrization (e.g. enhanced gradient parametrization (Clio et al. 2013)). How¬ 
ever, these changes would equally improve the baseline RBM, so we decided to 
concentrate on this more common learning setup. 


6.1 Binarized MNIST 


The MNIST dataset^ is composed of 70,000 images of size 28x28 pixels represent¬ 
ing handwritten digits (0-9). Images have been stochastically binarized according 


to their pixel intensity as in Salakhutdinov and Murray 

(2008 

). We use the same 

split as in 

Larochellc and Murray 

(2011 

), corresponding to 50,000 examples for 


training, 10,000 for validation and 10,000 for testing. 


Each model was trained up to 5000 epochs but we performed AIS evaluation 
every 1000 epochs and kept the model having the best NLL approximation on the 
valid set. We report the associated NLL approximations obtained on the test set. 
Taking after past studies assessing RBM results on binarized MNIST, we fixed 
the number of hidden units to 500 for the RBM and the oRBM. Best results for 
the RBM, oRBM and iRBM are reported in Table [lj The oRBM and the iRBM 
models reach competitive performance compared to the RBM. Samples from all 
three models are illustrated in Figure [5} 

o 

z http: //yann. lecun.com/exdb/mnist 
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Table 1: Average NLL on binarized MNIST test set for best RBMs, oRBM and 
iRBM. Partition functions were estimated using AIS with 100,000 intermediate 
distributions and 5000 chains. The confidence interval on the average NLL assumes 
In Z has no variance and reflects the confidence of a finite sample average. By 
taking the uncertainty about the partition function into account, the interval 
would be larger. 





Binarized MNIST 

Model 

Size 

In Z 

ln(Z ± 3cr) 

Avg. NLL 

RBM 

100 

600.92 

[600.88, 600.95] 

98.17 ± 0.52 

RBM 

500 

613.28 

[613.24, 613.31] 

86.50 ± 0.44 

RBM 

2000 

1099.07 

[1098.94, 1099.17] 

85.03 ± 0.42 

oRBM 

500 

40.06 

[39.90, 40.19] 

88.15 ± 0.46 

iRBM 

1208 

40.32 

[40.03, 40.54] 

85.65 ± 0.44 
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(a) RBM (b) iRBM 


Figure 4: Comparing the Liters of an RBM and an iRBM both trained on 
binarized MNIST. The first 96 filters are shown starting from the top-left corner 
and incrementing across columns first. 
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Figure 5: Comparison between data from binarized MNIST and random samples 
generated from the three models by randomly initializing visible units and running 
10,000 Gibbs steps. The RBM and oRBM both have 500 hidden units, whereas 
the iRBM final size is 1208 hidden units. 
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Figure 6: Each row shows a plot of P(z |v) where v is a given example from 
MNIST test set and is displayed to the left. The Erst row illustrates the impact 
of a noisy image on sampling z. As explained in Section 3 of the paper, we see 
that different input images are related to different values for the number z of used 
units. 
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Figure 7: (Bottom) Top 10 inputs from the test set with highest value of 
P(z |v) within different intervals for z, i.e. argmax v P(a < z < 6|v) for different 
intervals [a, b). Interestingly, bolder inputs seem to be related to bigger values 
for the number z of used units. Also, simpler characters (e.g. ’’ones”) tend to 
favor smaller values of z compared to more complex characters. (Top) Average 
of P(a < z < 6|v) over the top 10 inputs. Low values highlights regions in the 
hidden layer where the hidden units are only useful when taken together with 
hidden units further right in the layer. 
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The best RBM (500 hidden units) was trained without any regularization and 
Ir = 10 -2 for 5000 epochs. We used our own implementation to train the RBM, 


which is why our result slightly differs from what is reported by Salakhutdinov 


and Murray (2008). The difference can be justified by the fact that they used the 


full 60,000 training set images, instead of a 50,000 subset. Also, they use a custom 
schedule to gradually increase the number of CD steps during training. That said, 
the oRBM and the iRBM would probably also benefit from having more training 
data and an improved sampling strategy. 

The best oRBM (500 hidden units) was trained without any regularization 
and Ir = 3xl0 -2 for 500 epochs. After 3000 epochs, the best iRBM had 1208 
hidden units with non-zero weights, ft was trained with LI regularization using a 
regularization factor of A = 10~ 4 and Ir — 5xlCU 2 . 

To show that our best iRBM does find an appropriate number of hidden units, 
we compared it with two other RBMs having respectively 100 and 2000 hidden 
units. Both were trained for 5000 epochs without any regularization and respec¬ 
tively with lr = 10 _1 and lr = 1CU 2 . Results are reported in Table [l] where we can 
see the oRBM and the iRBM still achieve competitive results compared to the 
RBM with 2000 hidden units. 

Figure [4] shows the ordering effect on the filters obtained with an iRBM. The 
ordering is even more apparent when observing the hidden unit Liters during 
training. We generated a video of this visualization, illustrating the Liter values 


and the generated negative samples at epochs 1, 
//goo.gl/LGQDal, 


10, 50 and 100. See link: http: 


Interestingly, we’ve observed that Gibbs sampling can mix much more slowly 
with the oRBM. The reason is the addition of variable 2 increases the dependence 
between states and thus hurts the convergence of Gibbs sampling. In particular, 
we observed that when the Gibbs chain is in a state corresponding to a noisy image 
without any structure, it can require many steps before stepping out of this region 
of the input space. Yet, comparing the free energy of such random images and 
images that resemble digits conbrmed that these random images have signiLcantly 
higher free energy (and thus are unlikely samples of the model). Figure [6] also con- 
Lrrns the high dependence between z and v: the distribution of the unstructured 
image is peaked at z = 1, while all digits prefer values of z greater than 250. To 
Lx this issue, we’ve found that simply initializing the Gibbs chain to z = K was 
sufficient. We used this when sampling from a trained oRBM model. 

The iRBM doesn’t seem to suLer as much from a low mixing rate and thus 
doesn’t require the z = K initialization heuristic for sampling. In fact, using the 
heuristic when sampling from an iRBM has almost no impact on the Lnal samples 
when running 10,000 Gibbs steps. This could be an artefact of the model being 
trained progressively, i.e. we only add one hidden unit when sampling a large value 
for z bigger than l. Understanding how the lower mixing rate affects the proposed 
models and if a heuristic such as the one we mentioned earlier could be used to 
improve training is a topic left for future work. 

We’ve also investigated what kind of inputs are maximizing P(z |v), for differ¬ 
ent values of z. Using our best iRBM model trained with LI regularization, we 
generated Figure [7J It highlights the fact that P(z |v) does capture some structure 
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Table 2: Average NLL on CalTechlOl Silhouettes test set estimated using AIS 
with 100,000 intermediate distributions and 5000 chains. The confidence interval 
on the average NLL assumes In Z has no variance and reflects the confidence of a 
finite sample average. By taking the uncertainty about the partition function into 
account, the interval would be larger. 





CalTechIOI Silhouettes 

Model 

Size 

In Z 

In (Z ± 3cr) 

Avg. NLL 

RBM 

100 

2512.20 

[2511.62, 2512.56] 

177.37 ± 2.81 

RBM 

500 

2385.91 

[2385.68, 2386.10] 

119.05 ± 2.27 

RBM 

2000 

3353.47 

[3349.85, 3354.15] 

118.29 ± 2.25 

oRBM 

500 

1782.96 

[1782.88 1783.02] 

114.99 ± 1.97 

iRBM 

915 

2000.08 

[1999.93, 2000.22] 

121.47 ± 2.07 
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Figure 8: Comparison between data from CalTechlOl Silhouettes and random 
samples generated from three models by randomly initializing visible units and 
running 10,000 Gibbs steps. The RBM and oRBM both have 500 hidden units, 
whereas the iRBM final size is 915 hidden units. 


about the data, as the identity of the character with highest P(z|v) vary between 
different values of z. 


6.2 CalTechlOl Silhouettes 


The CalTechlOl Silhouettes dataset)/] (Marlin et al.) is composed of 8,671 images 
of size 28x28 binary pixels, representing object silhouettes (101 classes). The 
dataset is divided in three subsets: 4,100 examples for training, 2,264 for validation 
and 2,307 for testing. 

Following a protocol similar to the one used for MNIST, each model was trained 
up to 5000 epochs, AIS evaluation was done every 1000 epochs. We report the 
NLL approximations obtained on the test set. Best results for the RBM, oRBM 
and iRBM are reported in Table [2] Again, the oRBM and the iRBM models reach 


^http://people.cs.umass.edu/~marlin/data.shtml 
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competitive performance compared to the RBM. Samples from all three models 
are illustrated in Figure [8j 

The best RBM (500 hidden units) was trained without any regularization and 
Ir = ICR 2 for 3000 epochs. We used our own implementation to train the RBM. 
The best oRBM (500 hidden units) was trained with LI regularization using a 
regularization factor of A = 10~ 3 and Zr = ICR 2 for 5000 epochs. After 4000 epochs, 
the best iRBM had 915 hidden units with non-zero weights. It was trained with 
LI regularization using a regularization factor of A = 1CR 3 and Ir = 5xlCR 2 . 

Again, to show that our best iRBM does find an appropriate number of hidden 
units, we compared it with two others RBMs having respectively 100 and 2000 
hidden units. Both were trained without any regularization and respectively with 
Ir = ICR 1 for 5000 epochs and Ir = ICR 2 for 2000 epochs. Results are reported 
in Table [2] where we can see the oRBM and the iRBM still achieve competitive 
results compared to the RBM with 2000 hidden units. 


Conclusion 

We proposed a novel extension of the RBM, the infinite RBM, which obviates 
the need to specify the hidden layer size. The iRBM is derived from the ordered 
RBM by taking the infinite limit of its hidden layer size. We presented a training 
procedure, derived from Contrastive Divergence, such that training the iRBM 
yields a learning procedure where the effective hidden layer size can grow. 

In future work, we are interested in generalizing the idea of a growing la¬ 
tent representation to structures other than a flat vector representation. We are 
currently exploring extensions of the RBM allowing for a tree-structured latent 
representation. We believe a similar construction, involving a similar 2 random 
variable, should allow us to derive a training algorithm that also learns the latent 
representation’s size. 
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A Partial derivatives 


Partial derivatives related to the RBM 

Recall equation (|5]) representing the free energy of the RBM: 

K 

F(y) = -v T b v - J2 soft+(W,v + 6* 1 ) 

i =1 


Taking the partial derivatives of F(v) w.r.t. b\ and 6J respectively, we obtain 
the following: 


<9F(v) _ ^ <9soft + (W fc .v + b\) 


<9F(v) _ ^ <9soft + (W fc .v + b\) 

db\ ^ M 

1 k= 1 1 

dF(y) _ v T b v _ 

db J “ dbf ~ ~ Vj 


E 


C r(W,.V + b\) 


aw, v 

dw tj 


—cr(Wj.v + 
(30) 


h"(Wiv + b h k)g£ = -ff(W,.v + t, h ) 

fc=l * 

(31) 


(32) 


where cr(Wj.v + b\) can be expressed as a conditional expectation over hi using 
equation ( Jl2| ) 

c(Wj.v + b^) = P(hi = l|v) = ^2 P(hi = l\v)hi = E[hi\v] 

0 , 1 } 


Partial derivatives related to the oRBM and the iRBM 


Recall equation (19) representing the free energy of the oRBM: 

K 


F(v) = In ^ ■ 


-F(y,z) 


w Z=1 


where 


F(v, z) = -v 1 b v - (soft + (Wj.v + b\) - /3soft +(^)) 

i —1 
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The partial derivatives of F(v, z) w.r.t. W t j , b\ and 6J are similar to equations (30), 
(31) and (32) from the RBM and are respectively given by 


dF{y,z) 

dWij 


dF(v, z) 

dbf 


dF(y,z) 

ab ) 


E 

k =1 


<9soft + (W fc .v + b\) 


dW, 


- y^a(Wfc.v 


k =1 


V 


,h ^W,.v 


-H(z - i ) cr(Wj.v + &J))vj 


E 

fc=i 


<9soft + (W fc .v + i>£) - /3soft + (5jl) 


- (^( w fc-v + $) - M&M S 

fc=i * 

-^-0 + 
v r b v 


(33) 


(34) 

(35) 


with the Heaviside step function denoted as 


H{n) 


0, n < 0 
1, n > 0 


Then, the partial derivatives of F(v) w.r.t. W iv b\ and 6J are obtained respec- 
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tively as follows: 


dF(v) 

dWi 


v 


K 

E 


e F(v ' 2 ) dF(v, z ) 

-K 


E?=i e~ F M dW i3 


K 


z|v) H(z — i) cr(Wj.v + b\) Vj 

2=1 

K 

-E*<*- i) P(z\v) P(hi = 1| v,z)v. 


z= 1 
K 

= -^ ^ #(2 — P(^|v) P(/ij|v,2:)? 

2=1 /i;e{0,l} 

= - E [H(z - i)hi\v]vj 

h it z 

EM F P , zM ^EA 

db\' Z—/ Qbl' 


(36) 


Z = 1 


K 


- ]T P(z|v) //(*- - i) (ff(W,v + 6?) - )) 

Z=1 

K 

-E H( - Z ~ P ( 2 I V ) ( p ( ft < = !| v .2) - j9o-(6?)) 

2=1 

K 

- E E 2 - P ( 2 W - P(h, = l|v, z) + (1 - Dad^PiK 

2=1 

K 

~ Y1 H ( z - *) p (^l v ) ((° ~ ^ a ( bl l)) p ( h i = °l v > *) + (! - Pv($))P(hi ■■ 


2=1 

K 


d F(v) 

^T 


= - H(z-i) (hi - M^)) p (-l v ) ^(^|v, . 

2=1 hi&{ 0 , 1 } 

= - E [#(2 - i) (K - M^)) |v] 

hi,z 

v T b v 


(37) 


<%I 


Observe that in equations (36) and (37), Y^=i P ( z \ w ) H(z — i) corresponds to 
P(z > i|v). This then translates to (1 — cdf(^|v)) when deriving the gradients as 
shown in equations (23) and (24). 


! l|v, z)) 

l|v,z)) 
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B Convergence of the partition function for the 
iRBM 

We will show that the partition function Z of the iRBM is finite. To do so, we 
take the limit of K —» oo of equation (16): 


2 = 


3 —E(v,h,2) 


vev z=i hen z 

oo 

3 --F(v,*) 


EE 

vev 2=1 

E z < v 


(38) 


vev 


Since the sum over all v G V is finite and we know from equation (27) that Z(y) 
is finite, then Z is also finite. 
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